The dataset we will be working with represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.
The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.
The dataset is avaliable from the UCI Machine Learning Repository.
The project is a 2-way classification task in which we try to predict if a patient will either
# Reading in data
data <- read.csv("dataset/diabetic_data.csv")
# remove duplicates
data <- data[!duplicated(data$patient_nbr), ]
# summary of the data
str(data)
## 'data.frame': 71518 obs. of 50 variables:
## $ encounter_id : int 2278392 149190 64410 500364 16680 35754 55842 63768 12522 15738 ...
## $ patient_nbr : int 8222157 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 ...
## $ race : Factor w/ 6 levels "?","AfricanAmerican",..: 4 4 2 4 4 4 4 4 4 4 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 1 1 1 2 2 2 2 2 1 1 ...
## $ age : Factor w/ 10 levels "[0-10)","[10-20)",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weight : Factor w/ 10 levels "?","[0-25)","[100-125)",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ admission_type_id : int 6 1 1 1 1 2 3 1 2 3 ...
## $ discharge_disposition_id: int 25 1 1 1 1 1 1 1 1 3 ...
## $ admission_source_id : int 1 7 7 7 7 2 2 7 4 4 ...
## $ time_in_hospital : int 1 3 2 2 1 3 4 5 13 12 ...
## $ payer_code : Factor w/ 18 levels "?","BC","CH",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ medical_specialty : Factor w/ 73 levels "?","AllergyandImmunology",..: 39 1 1 1 1 1 1 1 1 20 ...
## $ num_lab_procedures : int 41 59 11 44 51 31 70 73 68 33 ...
## $ num_procedures : int 0 0 5 1 0 6 1 0 2 3 ...
## $ num_medications : int 1 18 13 16 8 16 21 12 28 18 ...
## $ number_outpatient : int 0 0 2 0 0 0 0 0 0 0 ...
## $ number_emergency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ number_inpatient : int 0 0 1 0 0 0 0 0 0 0 ...
## $ diag_1 : Factor w/ 717 levels "?","10","11",..: 126 145 456 556 56 265 265 278 254 284 ...
## $ diag_2 : Factor w/ 749 levels "?","11","110",..: 1 81 80 99 26 248 248 316 262 48 ...
## $ diag_3 : Factor w/ 790 levels "?","11","110",..: 1 123 768 250 88 88 772 88 231 319 ...
## $ number_diagnoses : int 1 9 6 7 5 9 7 8 8 8 ...
## $ max_glu_serum : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1Cresult : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
## $ repaglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ nateglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ chlorpropamide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ glimepiride : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
## $ acetohexamide : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ glipizide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 3 2 2 2 3 2 ...
## $ glyburide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
## $ tolbutamide : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ pioglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ rosiglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
## $ acarbose : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ miglitol : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ troglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ tolazamide : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ examide : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
## $ citoglipton : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
## $ insulin : Factor w/ 4 levels "Down","No","Steady",..: 2 4 2 4 3 3 3 2 3 3 ...
## $ glyburide.metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ glipizide.metformin : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ metformin.pioglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ change : Factor w/ 2 levels "Ch","No": 2 1 2 1 1 2 1 2 1 1 ...
## $ diabetesMed : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ readmitted : Factor w/ 3 levels "<30",">30","NO": 3 2 3 3 3 2 3 2 3 3 ...
Before moving on to do some data cleaning, let’s observe the descriptive statistics of each column in the training dataset.
The skimr package provides a nice solution to show key descriptive stats for each column.
data[data == "?"] <- NA
skimmed <- skim_to_wide(data)
knitr::kable(skimmed[, c(1:3, 5:6, 9:10, 16)])
| type | variable | missing | n | n_unique | mean | sd | hist |
|---|---|---|---|---|---|---|---|
| factor | A1Cresult | 0 | 71518 | 4 | NA | NA | NA |
| factor | acarbose | 0 | 71518 | 3 | NA | NA | NA |
| factor | acetohexamide | 0 | 71518 | 2 | NA | NA | NA |
| factor | age | 0 | 71518 | 10 | NA | NA | NA |
| factor | change | 0 | 71518 | 2 | NA | NA | NA |
| factor | chlorpropamide | 0 | 71518 | 4 | NA | NA | NA |
| factor | citoglipton | 0 | 71518 | 1 | NA | NA | NA |
| factor | diabetesMed | 0 | 71518 | 2 | NA | NA | NA |
| factor | diag_1 | 11 | 71518 | 696 | NA | NA | NA |
| factor | diag_2 | 294 | 71518 | 725 | NA | NA | NA |
| factor | diag_3 | 1225 | 71518 | 758 | NA | NA | NA |
| factor | examide | 0 | 71518 | 1 | NA | NA | NA |
| factor | gender | 0 | 71518 | 3 | NA | NA | NA |
| factor | glimepiride | 0 | 71518 | 4 | NA | NA | NA |
| factor | glimepiride.pioglitazone | 0 | 71518 | 1 | NA | NA | NA |
| factor | glipizide | 0 | 71518 | 4 | NA | NA | NA |
| factor | glipizide.metformin | 0 | 71518 | 2 | NA | NA | NA |
| factor | glyburide | 0 | 71518 | 4 | NA | NA | NA |
| factor | glyburide.metformin | 0 | 71518 | 4 | NA | NA | NA |
| factor | insulin | 0 | 71518 | 4 | NA | NA | NA |
| factor | max_glu_serum | 0 | 71518 | 4 | NA | NA | NA |
| factor | medical_specialty | 34477 | 71518 | 70 | NA | NA | NA |
| factor | metformin | 0 | 71518 | 4 | NA | NA | NA |
| factor | metformin.pioglitazone | 0 | 71518 | 2 | NA | NA | NA |
| factor | metformin.rosiglitazone | 0 | 71518 | 2 | NA | NA | NA |
| factor | miglitol | 0 | 71518 | 4 | NA | NA | NA |
| factor | nateglinide | 0 | 71518 | 4 | NA | NA | NA |
| factor | payer_code | 31043 | 71518 | 17 | NA | NA | NA |
| factor | pioglitazone | 0 | 71518 | 4 | NA | NA | NA |
| factor | race | 1948 | 71518 | 5 | NA | NA | NA |
| factor | readmitted | 0 | 71518 | 3 | NA | NA | NA |
| factor | repaglinide | 0 | 71518 | 4 | NA | NA | NA |
| factor | rosiglitazone | 0 | 71518 | 4 | NA | NA | NA |
| factor | tolazamide | 0 | 71518 | 2 | NA | NA | NA |
| factor | tolbutamide | 0 | 71518 | 2 | NA | NA | NA |
| factor | troglitazone | 0 | 71518 | 2 | NA | NA | NA |
| factor | weight | 68665 | 71518 | 9 | NA | NA | NA |
| integer | admission_source_id | 0 | 71518 | NA | 5.66 | 4.16 | ▅▇▁▁▁▁▁▁ |
| integer | admission_type_id | 0 | 71518 | NA | 2.1 | 1.51 | ▇▃▃▁▁▁▁▁ |
| integer | discharge_disposition_id | 0 | 71518 | NA | 3.59 | 5.27 | ▇▂▁▁▁▁▁▁ |
| integer | encounter_id | 0 | 71518 | NA | 1.6e+08 | 1e+08 | ▅▇▇▅▃▂▁▁ |
| integer | num_lab_procedures | 0 | 71518 | NA | 43.08 | 19.95 | ▃▃▇▆▂▁▁▁ |
| integer | num_medications | 0 | 71518 | NA | 15.71 | 8.31 | ▆▇▂▁▁▁▁▁ |
| integer | num_procedures | 0 | 71518 | NA | 1.43 | 1.76 | ▇▃▂▂▁▁▁▁ |
| integer | number_diagnoses | 0 | 71518 | NA | 7.25 | 1.99 | ▁▂▅▃▇▁▁▁ |
| integer | number_emergency | 0 | 71518 | NA | 0.1 | 0.51 | ▇▁▁▁▁▁▁▁ |
| integer | number_inpatient | 0 | 71518 | NA | 0.18 | 0.6 | ▇▁▁▁▁▁▁▁ |
| integer | number_outpatient | 0 | 71518 | NA | 0.28 | 1.07 | ▇▁▁▁▁▁▁▁ |
| integer | patient_nbr | 0 | 71518 | NA | 5.5e+07 | 3.9e+07 | ▇▇▅▆▅▁▁▁ |
| integer | time_in_hospital | 0 | 71518 | NA | 4.29 | 2.95 | ▇▇▂▃▂▁▁▁ |
Now, we will drop some of the columns for various reasons: - Encounter ID: an uneccessary column that provides no information for readmission - Patient Number: an unecessary column that provides no information for readmission - Weight: While potentially useful, there is too many missing values - Payer Code: too many missing values - Medical Specialty: probably unecessary and too many missing values - Diabetes Medication: redundant information - diag_2 and diag_3: ICD codes are hard to deal with and often lack predictive power
drops <- c("encounter_id", "patient_nbr", "weight", "payer_code",
"medical_specialty", "diabetesMed", "diag_2", "diag_3")
data <- data[ , !(names(data) %in% drops)]
Next, we need to drop any patients who’s discharge status is listed as “expired”. If a patient has died, there is no chance of readmission.
data <- filter(data, !(discharge_disposition_id %in% c(11, 19, 20, 21)))
We will also remove the 3 subjects with an unknon gender.
data <- data %>% filter(gender != "Unknown/Invalid")
data$gender <- droplevels(data$gender)
data$gender <- as.integer(data$gender) - 1
# Encoding ages
data$age <- ifelse(data$age == "[0-10)", 5,
ifelse(data$age == "[10-20)", 15,
ifelse(data$age == "[20-30)", 25,
ifelse(data$age == "[30-40)", 35,
ifelse(data$age == "[40-50)", 45,
ifelse(data$age == "[50-60)", 55,
ifelse(data$age == "[60-70)", 65,
ifelse(data$age == "[70-80)", 75,
ifelse(data$age == "[80-90)", 85, 95)))))))))
data$max_glu_serum <- ifelse(data$max_glu_serum == "None", 0,
ifelse(data$max_glu_serum == "Norm", 1, 2))
data$A1Cresult <- ifelse(data$A1Cresult == "None", 0,
ifelse(data$A1Cresult == "Norm", 1,
ifelse(data$A1Cresult %in% c(">7", ">8"), 2, NA)))
data$insulin <- ifelse(data$insulin == "No", 0, ifelse(data$insulin == "Down", 1, ifelse(data$insulin == "Steady", 2, 3)))
data$change <- ifelse(data$change == "Ch", 1, 0)
data$race <- ifelse(data$race == "Caucasian", 0, ifelse(data$race == "AfricanAmerican", 1, 2))
data$race[is.na(data$race)] <- 2
data$readmitted <- ifelse((data$readmitted == "NO" | data$readmitted == ">30"), 0, 1)
data %>% group_by(readmitted) %>% summarize(count=n())
## # A tibble: 2 x 2
## readmitted count
## <dbl> <int>
## 1 0 64138
## 2 1 6293
# Sum of number of outpatient, emergency and inpatient encounters
data$healthcare_utilization <- (data$number_outpatient + data$number_emergency + data$number_inpatient)
# Creating sum of other meds on column
# (not all of these columns have all four options, but it was easier to code this way - gives same result)
data$metformin <- ifelse(data$metformin %in% c("Up", "Steady", "Down"),1,0)
data$repaglinide <- ifelse(data$repaglinide %in% c("Up", "Steady", "Down"),1,0)
data$nateglinide <- ifelse(data$nateglinide %in% c("Up", "Steady", "Down"),1,0)
data$chlorpropamide <- ifelse(data$chlorpropamide %in% c("Up", "Steady", "Down"),1,0)
data$glimepiride <- ifelse(data$glimepiride %in% c("Up", "Steady", "Down"),1,0)
data$acetohexamide <- ifelse(data$acetohexamide %in% c("Up", "Steady", "Down"),1,0)
data$glipizide <- ifelse(data$glipizide %in% c("Up", "Steady", "Down"),1,0)
data$glyburide <- ifelse(data$glyburide %in% c("Up", "Steady", "Down"),1,0)
data$tolbutamide <- ifelse(data$tolbutamide %in% c("Up", "Steady", "Down"),1,0)
data$pioglitazone <- ifelse(data$pioglitazone %in% c("Up", "Steady", "Down"),1,0)
data$rosiglitazone <- ifelse(data$rosiglitazone %in% c("Up", "Steady", "Down"),1,0)
data$acarbose <- ifelse(data$acarbose %in% c("Up", "Steady", "Down"),1,0)
data$miglitol <- ifelse(data$miglitol %in% c("Up", "Steady", "Down"),1,0)
data$troglitazone <- ifelse(data$troglitazone %in% c("Up", "Steady", "Down"),1,0)
data$tolazamide <- ifelse(data$tolazamide %in% c("Up", "Steady", "Down"),1,0)
data$examide <- ifelse(data$examide %in% c("Up", "Steady", "Down"),1,0)
data$citoglipton <- ifelse(data$citoglipton %in% c("Up", "Steady", "Down"),1,0)
data$glyburide.metformin <- ifelse(data$glyburide.metformin %in% c("Up", "Steady", "Down"),1,0)
data$glipizide.metformin <- ifelse(data$glipizide.metformin %in% c("Up", "Steady", "Down"),1,0)
data$glimepiride.pioglitazone <- ifelse(data$glimepiride.pioglitazone %in% c("Up", "Steady", "Down"),1,0)
data$metformin.rosiglitazone <- ifelse(data$metformin.rosiglitazone %in% c("Up", "Steady", "Down"),1,0)
data$metformin.pioglitazone <- ifelse(data$metformin.pioglitazone %in% c("Up", "Steady", "Down"),1,0)
data$num_other_diabetes_meds_up_stdy_dwn <- (data$metformin + data$repaglinide + data$nateglinide + data$chlorpropamide + data$glimepiride + data$acetohexamide + data$glipizide + data$glyburide + data$tolbutamide + data$pioglitazone + data$rosiglitazone + data$acarbose + data$miglitol + data$troglitazone + data$tolazamide + data$examide + data$citoglipton + data$glyburide.metformin + data$glipizide.metformin + data$glimepiride.pioglitazone + data$metformin.rosiglitazone + data$metformin.pioglitazone)
drops2 <- c("metformin", "repaglinide", "nateglinide", "chlorpropamide", "glimepiride", "acetohexamide", "glipizide",
"glyburide", "tolbutamide", "pioglitazone", "rosiglitazone", "acarbose", "miglitol", "troglitazone",
"tolazamide", "examide", "citoglipton", "glyburide.metformin", "glipizide.metformin", "glimepiride.pioglitazone",
"metformin.rosiglitazone", "metformin.pioglitazone")
data <- data[ , !(names(data) %in% drops2)]
# set up dummy variables for admission types
data <- mutate(data, at_emergent = as.numeric(admission_type_id %in% c(1, 2, 7)),
at_elective = as.numeric(admission_type_id == 3),
at_other = as.numeric(admission_type_id %in% c(4, 5, 6, 8)))
# Set up dummy variables for discharge dispositions
data <- mutate(data, dd_home = as.numeric(discharge_disposition_id %in% c(1, 6, 8)),
dd_facility_transfer = as.numeric(discharge_disposition_id %in% c(2, 3, 4, 5, 10, 16, 17, 22, 23, 24, 30, 27, 28, 29, 13, 14)),
dd_other = as.numeric(discharge_disposition_id %in% c(7, 18, 25, 26, 9, 12, 15)))
# dd_admitted = as.numeric(discharge_disposition_id %in% c(9, 12, 15)), #lumping admitted into other
# add_expired = as.numeric(discharge_disposition_id %in% c(11, 19, 20, 21)), (dropped patients who expired)
# dd_hospice = as.numeric(discharge_disposition_id %in% c(13, 14))) #lumping hospice into transfer
# Set up dummy variables for admission source
data <- mutate(data, as_outpatient = as.numeric(admission_source_id %in% c(1, 2, 3)),
as_facility_transfer = as.numeric(admission_source_id %in% c(4, 5, 6, 10, 18, 19, 22, 25, 26)),
as_ed = as.numeric(admission_source_id %in% c(7)),
as_other = as.numeric(admission_source_id %in% c(8, 9, 15, 17, 20, 21, 11, 12, 13, 14, 23, 24)))
#as_newborn = as.numeric(admission_source_id %in% c(11, 12, 13, 14, 23, 24)))#as_newborn added to as_other
data <- within(data, rm(admission_type_id))
data <- within(data, rm(discharge_disposition_id))
data <- within(data, rm(admission_source_id))
data$diag_1 <- as.character(data$diag_1)
data$diag_1grp <- ifelse (data$diag_1 == "?", "Unknown",
ifelse(grepl(data$diag_1, pattern = "[EV]") == T, "External",
ifelse(floor(as.numeric(data$diag_1)) == 250, "Circulatory",
ifelse(data$diag_1 %in% c(390:459, 785), "Diabetes",
ifelse(data$diag_1 %in% c(460:519, 786), "Respiratory",
ifelse(data$diag_1 %in% c(520:579, 787), "Digestive",
ifelse(data$diag_1 %in% c(800:999), "Injury",
ifelse(data$diag_1 %in% c(710:739), "Musculoskeletal",
ifelse(data$diag_1 %in% c(580:629, 788), "Genitourinary",
ifelse(data$diag_1 %in% c(140:239), "Neoplasm",
ifelse(data$diag_1 %in% c(780, 781, 784, 790:799, 740:759), "Other", # added congenital to other
ifelse(data$diag_1 %in% c(240:249, 251:279), "Endocrine_Nutrition_Metabolic",
ifelse(data$diag_1 %in% c(680:709, 782), "Skin",
ifelse(data$diag_1 %in% c(001:139), "Infectious",
ifelse(data$diag_1 %in% c(290:319), "Mental",
ifelse(data$diag_1 %in% c(280:289), "Blood",
ifelse(data$diag_1 %in% c(320:359), "Nervous",
ifelse(data$diag_1 %in% c(630:679), "Pregnancy",
ifelse(data$diag_1 %in% c(360:389), "Sense", "Unknown")
#ifelse(data$diag_1 %in% c(740:759), "Congenital", "NULL")
))))))))))))))))))
## Warning in ifelse(floor(as.numeric(data$diag_1)) == 250, "Circulatory", :
## NAs introduced by coercion
data$diag_1grp <- as.factor(data$diag_1grp)
data <- one_hot(data.table(data))
data <- as.data.frame(data) # convert back to data.frame format
data <- within(data, rm(diag_1))
data <- na.omit(data)
data <- data %>% select(-readmitted, readmitted) # move readmitted to the last column value
cor_mat <- cor(data[, 1:46])
cor_mat[!lower.tri(cor_mat)] <- 0
data <- data[,!apply(cor_mat,2,function(x) any(abs(x) > 0.6))]
cor_mat2 <- cor(data[, 1:41])
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
cor_mat_melted <- melt(get_lower_tri(cor_mat2), na.rm = T)
ggplot(data = cor_mat_melted, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1),
legend.position = c(0,1), legend.justification = c(0,1),
axis.title = element_blank(), plot.title = element_text(hjust = 0.5)) +
labs(title = "Correlation matrix") +
scale_y_discrete(position = "right")
#corrplot(cor_mat2, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)
The dataset is now ready, and we will split it into training(80%) and test(20%) datasets using caret’s createDataPartition function.
# write cleaned dataset
write.csv(data, file = "dataset/clean_diabetic_data.csv", row.names = F)
# Create the training and test datasets
set.seed(123)
# Step 1: Get row numbers for the training data (80% train split, 20% testing split)
trainRowNumbers <- createDataPartition(data$readmitted, p=0.8, list=FALSE)
# Step 2: Create the training dataset
trainData <- data[trainRowNumbers,]
# Step 3: Create the test dataset
testData <- data[-trainRowNumbers,]
# Step 4: Save the datasets
write.csv(trainData, file = "dataset/trainData.csv", row.names = F)
write.csv(testData, file = "dataset/testData.csv", row.names = F)
pca_tsne_df <- data[,c(1:41)]
# scale variables
for(i in 1:ncol(pca_tsne_df)){
pca_tsne_df[[i]] <- ((pca_tsne_df[[i]] - mean(pca_tsne_df[[i]])) / sd(pca_tsne_df[[i]]))
}
pca_tsne_df <- t(pca_tsne_df)
#PCA
PC <- prcomp(pca_tsne_df)
var_explained <- data.frame("PC" = c(1:41), "eigenvalue" = (PC$sdev)^2)
var_explained$var_explained <- var_explained$eigenvalue / sum(var_explained$eigenvalue)
var_explained$cumulative <- cumsum(var_explained$var_explained) / sum(var_explained$var_explained)
ggplot(var_explained, aes(x = PC, y = var_explained)) +
geom_line() +
labs(title = "Elbow plot: proportion of variance explained by each PC")
ggplot(var_explained, aes(x = PC, y = cumulative)) +
geom_line() +
labs(title = "Cumulative variance explained by PC_1 through PC_n",
x = "n", y = "vat_explained")
outdf <- cbind(data, PC$rotation)
outdf$readmitted <- as.factor(outdf$readmitted)
ggplot(outdf, aes(x = PC1, y = PC2)) +
geom_point(aes(color = readmitted), alpha = 0.5) +
scale_color_manual(values = c("cadetblue3", "blue")) +
labs(title = "Classes are not well-separated by principal components")
# TSNE
tsne_model_1 = Rtsne(t(as.matrix(pca_tsne_df)), check_duplicates=FALSE, pca=TRUE,perplexity = 10, theta=0.5, dims=2)
## getting the two dimension matrix
d_tsne_1 = as.data.frame(tsne_model_1$Y)
d_tsne_1$readmitted <- as.factor(data$readmitted)
## plotting the results without clustering
ggplot(d_tsne_1, aes(x=V1, y=V2)) +
geom_point(aes(color = readmitted)) +
scale_color_manual(values = c("cadetblue3", "blue")) +
labs(title = "Classes are not well-separated by t-SNE")
We do not see separation of readmitted classes by PCA or t-SNE analysis.
ggplot(data=data, aes(x=factor(race, labels = c("Caucasian", "African American", "Other")),
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_bar(position=position_dodge(), color = "black") +
labs(title="Categorywise Bar Chart of Race", x="Race", y="Count", fill="Readmission Status")
ggplot(data=data, aes(x=factor(gender, labels = c("Female", "Male")),
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_bar(position=position_dodge(), color = "black") +
labs(title="Categorywise Bar Chart of Gender", x="Gender", y="Count", fill="Readmission Status")
ggplot(data=data, aes(x=factor(A1Cresult, labels = c("No Measurement", "Normal Reading", "Abnormal")),
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_bar(position=position_dodge(), color = "black") +
labs(title="Categorywise Bar Chart of A1C Test Result", x="A1C", y="Count", fill="Readmission Status")
ggplot(data=data, aes(x=factor(insulin, labels = c("No Insulin", "Decrease in Insulin", "Steady Insulin", "Increase in Insulin")),
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_bar(position=position_dodge(), color = "black") +
labs(title="Categorywise Bar Chart of Insulin Change", x="Insulin Change", y="Count", fill="Readmission Status")
ggplot(data=data, aes(x=factor(change, labels = c("No Change", "Change in Medication")),
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_bar(position=position_dodge(), color = "black") +
labs(title="Categorywise Bar Chart of Medication Change", x="Medication Change", y="Count", fill="Readmission Status")
ggplot(data=data, aes(x=time_in_hospital,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="Time in Hospital", y="Density",
title="Distribution of Time in Hospital Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=time_in_hospital)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="Time in Hospital", x="Readmission Status",
title="Distribution of Time in Hospital Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=num_lab_procedures,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="# of Lab Proedures", y="Density",
title="Distribution of # of Lab Procedures Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=num_lab_procedures)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Lab Procedures", x="Readmission Status",
title="Distribution of # of Lab Procedures Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=num_procedures,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="# of Non-Lab Proedures", y="Density",
title="Distribution of # of Non-Lab Procedures Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=num_procedures)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Non-Lab Procedures", x="Readmission Status",
title="Distribution of # of Non-Lab Procedures Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=num_medications,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="# of Medications Administered", y="Density",
title="Distribution of # of Medications Administered Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=num_procedures)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Medications Administered", x="Readmission Status",
title="Distribution of # Medications Administered Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=number_diagnoses,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="# of Diagnoses", y="Density",
title="Distribution of # of Diagnoses Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=number_diagnoses)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Diagnoses", x="Readmission Status",
title="Distribution of # of Diagnoses Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=age,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="End of Age Interval", y="Density",
title="Distribution of End Age of Interval Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=age)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="End of Age Interval", x="Readmission Status",
title="Distribution of End Age of Interval Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=healthcare_utilization,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="Healtchare Utilization", y="Density",
title="Distribution of Healthcare Utilization by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=healthcare_utilization)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="Healthcare Utilization", x="Readmission Status",
title="Distribution of Healtchare Utilization Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=num_other_diabetes_meds_up_stdy_dwn,
fill=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
geom_density(alpha=0.5) +
labs(x="# of Diabetes Medications Changed", y="Density",
title="Distribution of # of Diabetes Medications Changed Grouped by Readmission Status", fill="Readmission Status")
ggplot(data=data, aes(x=factor(readmitted,
labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")),
y=num_other_diabetes_meds_up_stdy_dwn)) +
geom_boxplot(fill="firebrick1", alpha=0.6) +
labs(y="# of Diabetes Medications Changed", x="Readmission Status",
title="Distribution of # of Diabetes Medications Changed Grouped by Readmission Status", fill="Readmission Status")
“SMOTE is an oversampling method which creates new instances of the minority class by forming convex combinations of neighboring instances. It effectively draws lines between minority points in the feature space, and samples along these lines. This allows us to balance our data-set without as much overfitting, as we create new synthetic examples rather than using duplicates. This however does not prevent all overfitting, as these are still created from existing data points.” (https://towardsdatascience.com/dealing-with-imbalanced-classes-in-machine-learning-d43d6fa19d2)
if(file.exists("dataset/trainSMOTE.csv")) {
trainData_SMOTE <- read.csv("dataset/trainSMOTE.csv")
} else {
trainData$readmitted <- as.factor(trainData$readmitted)
trainData_SMOTE <- SMOTE(readmitted ~ ., trainData)
write.csv(trainData_SMOTE, "dataset/trainSMOTE.csv", row.names = F)
}
XGBoost stands for eXtreme Gradient Boosting. You can read more about it here.
# create XGB Boost dense matrix objects --> faster performance
dtrain <- xgb.DMatrix(data = as.matrix(trainData_SMOTE[, 1:41]), label = as.numeric(trainData_SMOTE$readmitted))
dtest <- xgb.DMatrix(data = as.matrix(testData[, 1:41]), label = as.numeric(testData$readmitted))
In machine learning, hyperparameter optimization is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.
if(file.exists("xgb_params.csv")) {
best_params <- read.csv("xgb_params.csv")
} else {
param_grid <- expand.grid(objective = "binary:logistic",
eval_metric = "auc",
eta = c(0.1, 0.3, 0.5),
max_depth = c(2,4,6),
gamma = c(2,3,4),
scale_pos_weight = c(2, 3))
best_test_auc <- 0
best_params <- param_grid[1, ]
for (i in 1:nrow(param_grid)) {
print(i)
xgb_cv <- xgb.cv(params = param_grid[i, ], data = dtrain, nrounds = 150,
nfold = 10, early_stopping_rounds = 5, verbose = F)
if(xgb_cv$evaluation_log[nrow(xgb_cv$evaluation_log), test_auc_mean] > best_test_auc) {
best_test_auc = xgb_cv$evaluation_log[nrow(xgb_cv$evaluation_log), test_auc_mean]
best_params <- param_grid[i, ]
}
}
write.csv(best_params, "xgb_params.csv", row.names = F)
}
xgb <- xgb.train(params = best_params, data = dtrain, nrounds = 500)
xgbPredict <- predict(xgb, dtest)
xgbPredict <- as.numeric(xgbPredict> 0.5)
confusionMatrix(reference = as.factor(testData$readmitted), data = as.factor(xgbPredict), mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9761 753
## 1 3033 537
##
## Accuracy : 0.7312
## 95% CI : (0.7238, 0.7385)
## No Information Rate : 0.9084
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0999
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.41628
## Specificity : 0.76294
## Pos Pred Value : 0.15042
## Neg Pred Value : 0.92838
## Precision : 0.15042
## Recall : 0.41628
## F1 : 0.22099
## Prevalence : 0.09159
## Detection Rate : 0.03813
## Detection Prevalence : 0.25348
## Balanced Accuracy : 0.58961
##
## 'Positive' Class : 1
##
importance <- xgb.importance(model = xgb)
xgb.ggplot.importance(importance_matrix = importance, top_n = 10)
ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. You can read more about it here.
pred <- prediction(xgbPredict, testData$readmitted)
AUC <- performance(pred, "auc")@y.values[[1]]
perf_ROC=performance(pred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.9, 0.0,paste("AUC = ",format(AUC, digits=3, scientific=FALSE)))
ctrl <- trainControl(method="cv", number = 5)
knn <- train(as.factor(readmitted) ~ ., data = trainData_SMOTE, method = "knn", metric = "kappa",
trControl = ctrl, preProcess = c("center","scale"), tuneLength = 10)
## Warning in train.default(x, y, weights = w, ...): The metric "kappa" was
## not in the result set. Accuracy will be used instead.
knn
## k-Nearest Neighbors
##
## 35007 samples
## 41 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (41), scaled (41)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 28005, 28006, 28006, 28006, 28005
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7256547 0.4574105
## 7 0.7103434 0.4251942
## 9 0.6953464 0.3938713
## 11 0.6867195 0.3750039
## 13 0.6782070 0.3566975
## 15 0.6700658 0.3392946
## 17 0.6662666 0.3305906
## 19 0.6632959 0.3237004
## 21 0.6580968 0.3121660
## 23 0.6524122 0.2997506
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
plot(knn)
knnPredict <- predict(knn, testData)
confusionMatrix(reference = as.factor(testData$readmitted), data = as.factor(knnPredict), mode = "everything", positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 8111 704
## 1 4683 586
##
## Accuracy : 0.6175
## 95% CI : (0.6094, 0.6255)
## No Information Rate : 0.9084
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.037
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.45426
## Specificity : 0.63397
## Pos Pred Value : 0.11122
## Neg Pred Value : 0.92014
## Precision : 0.11122
## Recall : 0.45426
## F1 : 0.17869
## Prevalence : 0.09159
## Detection Rate : 0.04161
## Detection Prevalence : 0.37411
## Balanced Accuracy : 0.54412
##
## 'Positive' Class : 1
##
knnPred <- prediction(as.numeric(knnPredict), testData$readmitted)
knn_AUC <- performance(knnPred, "auc")@y.values[[1]]
knn_perf_ROC=performance(knnPred,"tpr","fpr") #plot the actual ROC curve
plot(knn_perf_ROC, main="ROC plot")
text(0.9, 0.0,paste("AUC = ",format(knn_AUC, digits=3, scientific=FALSE)))